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ABSTRACT 

Two-dimensional electrostatic simulation codes using 
the particle-in-cell model are developed on the Mas- 
sively Parallel Processor (MPP). The conventional 
plasma simulation procedure that computes electric 
fields at particle positions by means of a gridded sys- 
tem is found inefficient on the MPP. The MPP simula- 
tion code is thus based on the gridless system in which 
particles are assigned to processing elements and elec- 
tric fields are computed directly via Discrete Fourier 
Transform. Currently the gridless model on the MPP 
in two dimensions is about nine times slower than the 
gridded system on the CRAY X-MP without consid- 
ering I/O time. However, the gridless system on the 
MPP can be improved by incorporating a faster I/O 
between the staging memory and Array Unit and a 
more efficient procedure for taking floating point sums 
over processing elements. Our initial results suggest 
that the parallel processors have the potential for per- 
forming large scale plasma simulations. 

Keywords: plasma simulation, gridded system, grid- 
less system, discrete Fourier transform 

INTRODUCTION 

Plasma simulations have been used extensively in 
space physics and fusion energy research. Although 
the basic physical principles of plasma physics are well 
understood, the coupled physical phenomena are so 
complex that analytical studies cannot be easily car- 
ried out. With the advent of powerful supercomputers, 
computer simulations have been used to gain insight 
into physical phenomena. 

One class of plasma simulation model is the particle- 
in-cell (PIC) model, which treats the plasma at the 
microscopic level by following the motion of a large 
number of particles. We have been using a particle-in- 
cell code to examine the instabilities of natural elec- 
tron beams observed by the NASA Dynamics Explorer 
satellite at high altitudes (> 10, 000 km) (Ref. 1). The 
simulation code typically uses over 400,000 simula- 
tion particles and a 32 x 32 spatial grid in the two- 
dimensional problem. It is difficult to improve the 


spatial resolution with our available resources, because 
a finer spatial grid would sharply increase the simu- 
lated particle number and would require unreasonable 
computer time. Likewise, three-dimensional simula- 
tions are beyond our capability at the present time. 
Buzbee (Ref. 2) examined the parallel properties of 
particle codes for fusion studies and concluded that 
a large percentage of plasma simulation programs can 
be processed in parallel. We investigated potential im- 
provement in two-dimensional and three-dimensional 
plasma simulations by carrying out simulations on the 
Massively Parallel Processor (MPP). 

Particle-in-cell simulation models have been developed 
for the parallel-architecture CHI computer by Dawson 
et al. (Ref. 3). Their results indicate that the system 
is cost-effective and offers a significant improvement 
in performance. The CHI plasma simulation system 
consists of six microprocessors: one array processor, 
one macro- processor and four I/O processors. The ar- 
ray processor does most of the calculations, the I/O 
processors move data around and the macro-processor 
handles scheduling and control. The parallel process- 
ing among these processors accounts for most of the 
performance advantage of the system. Buzbee (Ref. 
2) has implemented a particle-in-cell code on two par- 
allel processing devices-the UNIVAC 1100/84 and the 
Denelcor Heterogeneous Element Processor (HEP); re- 
sults from these two parallel processors demonstrated 
that a large percentage of the total computation can 
be done in parallel. 

For these previous simulation models, the number of 
parallel processors is less than eight. Plasma simu- 
lation models have not been developed for a single- 
instruction-stream and multiple-data-stream (SIMD) 
processor like the MPP which has 16,384 arithmetic 
processors configured as a 128 x 128 array. Without 
knowing how to take advantage of the unique struc- 
ture of the MPP, we first attempted to transform our 
current particle code to run on the MPP. As might 
be expected, we encountered difficulties in achiev- 
ing good performance on the MPP using the particle 
code optimized for the vector computer CRAY. Before 
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discussing the difficulties and the new approach, we 
present some background on the procedures of plasma 
simulation models. 

PARTICLE-IN-CELL MODEL 

The particle-in-cell simulation code usually represents 
the plasma as a large number of computational par- 
ticles (usually greater than 100,000 particles for each 
species) which move according to classical mechanics 
in the self-consistent electromagnetic fields. The two- 
dimensional spatial system is then divided into fixed 
spatial cells or grids on which charge densities, poten- 
tials, and fields are defined. For the purposes of il- 
lustration, we will discuss only the electrostatic model 
which has no electric current density. In its simplest 
form, the procedures of the plasma simulation are: 

1. computation of the charge density at cell cen- 
ters. A particle is assumed to have a finite size 
comparable to the cell size to reduce numerical 
noise. The charge density of each cell is calcu- 
lated by accumulating each particle’s contribu- 
tion according to its occupied area in the cell. 

2. computation of electric fields from the charge 
density using Poisson’s equation. Poisson’s 
equation is usually solved by using a finite- 
difference scheme or Fast Fourier Transform. 

3. interpolation of the electric forces on the parti- 
cles from the electric fields at the nearest grid 
points. 

4. application of the electric force to advance the 
particle velocities and then positions in time us- 
ing a simple leapfrog scheme. 

The mathematical formulas of the electrostatic model 
are briefly described here. The model consists of finite- 
size particles, moving in a uniform and constant mag- 
netic field B and interacting via a self-consistent elec- 
tric field E. The equations of motion are 


dxJi/dt 

= (qi/mi)(E+ — x B) 
c 

(1) 

dfi/dt 

= Vi 

(2) 


Here m, and are the velocity, mass, and charge of 
the ith particle with the center position denoted by rj. 
The particles have a rectangular shape with a width 
A. The charge density p is then 

P(r) = 3) (3) 


The shape function S(r- rj) is 1 when both (x — X{) 
and ( y - y { ) are negative, and is 0 otherwise. Here x 
and y are the two coordinates of r. Poisson’s equation 
is 

V 2 <j> - -4tt p(r) (4) 

where <j> is the electric potential. The electric field E 
is defined as 

E = - V0 (5) 

We will focus on the numerical method of obtaining 
the electric field E, which is the key problem in im- 
plementing plasma simulation models on parallel com- 
puters. The algorithm of solving the electric field E 
from Equations (4) and (5) is usually based on Fourier 
Transforms. For an infinite continuous system, the 
Fourier Transforms of Equations (4) and (5) yield 

E(k) = -i4*kS(k)p(k)/k 2 (6) 

where E , S(k) and p{k) are the transformed quanti- 
ties. The infinite continuous transform is defined here 
as 

p{k) = J drp{r)e- ik ? (7) 

An infinite continuous system has no grid; therefore, 
the charge density p{k) becomes a summation over the 
particle positions 

p{k) = S(k) ^ qie~ ik r ~' (8) 

i 

When the particle number is large, p(k) in Equation 
(8) is generally too inefficient for computation. In- 
stead a fast algorithm involving a gridded system and 
Fast Fourier Transform is often used. Figure 1 shows 
the flow chart of computing electric field E(fl) at the 
location of particles r\ in a gridded system. Because 
the Fast Fourier Transform of p(k) is computed from 
a small number of p(r* g ) at the cell centers r^, the al- 
gorithm is very efficient. The saving in computation 
makes up for the extra steps in collecting cell charge 
densities and interpolating the electric field at particle 
positions. 

Gridded System On the MPP 

Plasma simulation models using grids have been tested 
on several parallel computers and found to be feasible 
(Refs. 2-3). Our first plasma simulation model on the 
MPP therefore uses the gridded system. The program 
assigns a particle to each processing element in the 
MPP. Because the Array Unit has 16,384 processing 
elements, the program updates positions and veloci- 
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Figure 1. Flow chart of computing electric fields in 
the gridded and gridless systems. 

ties of 16,384 particles at the same time. However, 
we soon realized that collection of cell charge density 
and interpolation of electric field at particle positions 
cannot be computed in parallel or vectorized. 

We investigated the possibility of performing serial 
computations on the VAX, the frontend computer of 
the MPP. The serial calculations involved in comput- 
ing charge density at cell centers (Step 1) were com- 
pleted on the average in 7 seconds on the MPP and 
14.5 seconds on the VAX. A separate timing of the 
number crunching portion on the MPP yields only 
1.2 milliseconds, corresponding to a speed of approx- 
imately 230 Mflops (millions of floating point opera- 
tions). Therefore the MPP time of performing Step 
1 is essentially I/O time between the staging memory 
and the VAX. This approach is clearly not acceptable 
because of the excessive I/O time. 

The problem is somewhat alleviated on some super- 
computers, which can perform fast serial computation 
or do indirect indexing (Ref. 4). Some speedup in 


charge collection on the MPP might be achieved by us- 
ing an efficient sorting algorithm. However, the inter- 
polation cannot be performed efficiently on the MPP 
because it has only nearest-neighbor communication 
and a small memory in the Array Unit. We finally 
discarded the gridded system for the MPP and inves- 
tigated the alternative approach described below. 

GRIDLESS SYSTEM 

The algorithm for the gridless system as shown in Fig- 
ure 1 is much simpler than the gridded system. Equa- 
tions (6) and (8) are used directly to compute the 
Fourier Transform of electric field E(k). No interpo- 
lation is needed to obtain the electric field at parti- 
cle positions. Instead, the electric field at the parti- 
cle position is computed by using the inverse Discrete 
Fourier Transform 

E{x,y) = dk x dk y E{k x ,k y y( k * x + k ^ 

JO 

= YsE™ ei{2 ' IL){ - nx+my) ( 9 ) 

n,m 

Numerical evaluation of the integral in Equation (9) is 
inefficient. The integral is thus converted to discrete 
summations by assuming that the system is periodic 
and has a length of L . The summation is truncated by 
keeping only the low order terms; we have only kept 
terms with n < 16 and m < 16 in two dimensions. 
Since the right hand side of Equation (9) depends on 
only the local parameters (particle position), the cal- 
culation of E(z,y) can be parallelized. 

On the vector computer, the algorithm for the gridless 
system is much slower than that for the gridded sys- 
tem. Figure 2 shows the timing of the algorithms run 
on the CRAY X-MP at the San Diego Supercomputer 
Center. The algorithm of computing electric fields for 
the gridless and gridded systems in one dimension is 
shown in Figure 1. For particle number N = 16384, 
the gridless system requires about 15 times more CPU 
time than the gridded system to obtain comparable 
results on the CRAY X-MP. The reason for this re- 
markable difference in speed is simply that the gridless 
system uses a Discrete Fourier Transform whereas the 
gridded system uses a Fast Fourier Transform. The 
number of operations to be performed for the gridless 
system is proportional to NNk , where JV* is the num- 
ber of Fourier modes. For the gridded system the num- 
ber of operations is proportional to N g log N g , where 
N g is the number of cells and is usually taken to be 
32 for a model. Since Nk is chosen to to be N g /2 and 
N N gt the gridless system has many more arith- 
metic operations than the gridded system. 
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Figure 2. Algorithm timing of electric field calcula- 
tions in one dimension. 

We next discuss the algorithm timing for the grid- 
less system run on the MPP. The MPP CPU time we 
present includes only processing times of the Array 
Unit for arithmetic operations and the Main Control 
Unit for controlling I/O. The I/O time between the 
staging memory and the Array Unit is not yet included 
in our algorithm timing because we have not yet opti- 
mized the I/O and learned how to determine the I/O 
time accurately. It appears that the I/O time is many 
times the MPP CPU time. 

In CPU time required to compute electric field in the 
one-dimensional gridless system, the MPP is compa- 
rable to the CRAY X-MP in speed (Figure 2). The 
MPP is more efficient than the CRAY X-MP for the 
gridless system because it operates on 16,384 particles 
simultaneously and thus the number of operations is 
proportional to (N/l$384)N k . Note that the gridless 
system algorithm is highly parallelized on the MPP. 


We next compare the timing of computing electric field 
in two dimensions between the gridless system on the 
MPP and the gridded system on the CRAY X-MP 
(Figure 3). The MPP is about 24 times slower than 
the CRAY X-MP in the case of 16,384 particles. When 
N = 3 x 16384, the MPP is about 16 times slower than 
the CRAY X-MP. 



Figure 3. Timing comparison of two-dimensional elec- 
trostatic simulation codes between the CRAY X-MP 
and the MPP. Both the gridded and gridless systems 
are timed on the CRAY X-MP. Only the gridless sys- 
tem is timed on the MPP. 

In breaking down the CPU timing on the MPP, we 
found that the MPP spends about 60% CPU time for 
N = 16384 and 30% CPU time for N = 3 x 16384 in 
obtaining the floating point reduction sum, which is 
a summation taken over 16,384 processing elements. 
The amount of time spent on reduction sums is ex- 
cessive since the number of reduction sums is quite 
small, only 16 for one dimension and 256 for two di- 
mensions. We were able to determine that the routine 
in the MPP library for summing 32 bit floating point 
numbers over the processing elements had not been op- 
timized. Specifically, a summation of a 32 bit parallel 
array took about 2.04 milliseconds, which is about 55 
times the CPU time for addition of two 32 bit floating 
point numbers. 

The plasma simulation procedure outlined in the 
particle-in-cell model includes advancing particle posi- 
tions and velocities in addition to solving the electric 
field. Particle positions are checked to determine if 
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particles advance outside the system boundary. The 
periodic boundary condition is used to adjust the po- 
sitions of particles advanced outside the boundary 

x - x - L if x > L (10) 

and 

y= y-L if y > L (11) 

The procedure of implementing the boundary condi- 
tion cannot be fully vectorized on the CRAY X-MP. 
Furthermore, the MPP has an advantage over CRAY 
X-MP because it can advance 16,384 particles simulta- 
neously instead of 64 particles as processed by CRAY 
X-MP. 

Figure 3 presents the timing for the two-dimensional 
electrostatic simulation procedure outlined in the 
particle-in-cell model. The CRAY X-MP uses 0.2 sec- 
onds CPU time for simulating a plasma of 16,384 par- 
ticles, about three times the CPU time needed to com- 
pute the electric field. Thus the CRAY X-MP spends 
about one-third of the CPU time on Steps 1-3 and 
two-thirds of the CPU time on Step 4. In contrast, 
the additional CPU time needed for Step 4 on the 
MPP is negligible (« 1 millisecond). 

For N = 16384, the timing ratio of the two- 

dimensional electrostatic PIC code is about nine. The 
ratio remains constant when N increases to 3 x 16384. 


DISCUSSION 

The objective of this study is to develop an efficient 
MPP program to simulate beam plasma interactions 
in three dimensions. The initial results presented here 
show some difficulties in reaching this goal. The effi- 
cient method typically employed on the CRAY X-MP 
simulates plasmas in a gridded system, first computing 
electric fields at the grid points via Fast Fourier Trans- 
forms and then interpolating electric fields at particle 
positions. Because the MPP has only nearest-neighbor 
communication and a limited memory in the Array 
Unit, the gridded system is awkward and very slow 
on the MPP. We have thus adopted a gridless system 
that assigns a particle to a processing element and 
computes electric fields at particle positions directly 
via Discrete Fourier Transform. Currently, the grid- 
less model in two dimensions on the MPP is about nine 
times slower than the gridded system on the CRAY X- 
MP. Improvements in the CPU times for the MPP are 
still possible, since our MPP programs have not been 
fully optimized. In three-dimensional simulations, the 
MPP is expected to be much slower than the CRAY 
X-MP. 


It is obvious that the speedup factor for the gridless 
model on the MPP relative to the CRAY X-MP is 
large since the gridless model is highly parallelized. 
Indeed, when the performance of the one-dimensional 
gridless system on both computers is compared, the 
MPP is as fast as the CRAY X-MP (see Figure 2). 
However, a much more efficient method using the grid- 
ded system can be adapted on the CRAY X-MP but 
not the MPP. 

So far we have only tested the basic ideas of the 
gridless simulation model and shown that the unique 
structure of the MPP is suitable for processing this 
model. We have not yet completed the program for 
conducting plasma simulations. One critical area un- 
resolved is the transfer of diagnosis quantities from the 
MPP to the VAX for post processing. In order to min- 
imize the total run time, the program being designed 
will process I/O between the staging memory and the 
VAX in parallel with the Array Unit (Figure 4). The 
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Figure 4. Program design of the plasma simulation 
code 


189 










simulation diagnosis usually needs to examine only a 
portion of particles for about every hundred time steps 
or longer. We estimate that the MPP uses about 2.5 
seconds to transfer positions and velocities of 16,384 
particles. Simulating 16,384 particles in one hundred 
time steps, the Array Unit will spend about 200 sec- 
onds, which is more than enough time to output the 
diagnosis results. The program will check to see that 
the I/O is completed before computations continue. 
These features of our program should ensure that the 
I/O time between the staging memory and the VAX 
will not contribute significantly to the total run time. 

We have compared only the CPU time between the 
MPP and the CRAY X-MP. As mentioned before, the 
I/O time between the staging memory and the Array 
Unit is not yet timed. Since the I/O is very slow, the 
MPP performance is misleading without considering 
the I/O. By simulating only a small number of par- 
ticles (N < 49152), we do not need to use staging 
memory. Plasma simulation for realistic problems will 
have many more particles than were used here. Be- 
cause the Array Unit has a very limited memory (32 
floating point words), it is necessary to transfer tempo- 
rary variables to the staging memory. Our preliminary 
estimates indicate that tansferring a parallel array be- 
tween the staging memory and the Array Unit takes 
from 35 to 200 milliseconds. The I/O time is very 
slow and highly variable because the I/O initiated by 
the standard MPP routines is controlled by the fron- 
tend VAX computer. Recently we learned that the 
I/O time can be reduced by using efficient routines 
without involving the VAX. But the number of I/O 
operations for using the staging memory as auxilliary 
memory for the Array Unit is estimated to be at least 
3000 for N = 3 x 16384. Unless the I/O speed is im- 
proved significantly, we do not believe that the MPP 
can compete with the CRAY computers in speed. 

Another difficulty with the plasma simulation on the 
MPP is the floating point reduction sum over 16,384 
processing elements. We were unable to investigate an 
efficient algorithm for computing the reduction sum, 
but were told that the current algorithm can be im- 
proved considerably (private communication with E. 
Seiler). 

Having mentioned the disadvantage of the MPP, we 
feel it is only fair to discuss some advantages of the 
gridless model on the MPP. Implementation of the 
simulation procedure is simplified by assigning a par- 
ticle to a processing element (see flow chart in Figure 
1). Without using the grid, the simulation code also 
avoids numerical noises due to the grid. Finally, little 
effort is needed to extend the code to three dimensions. 


The experience of developing plasma simulation codes 
on the MPP has proved to be more challenging than 
anticipated. Our original attempt, implementing the 
conventional particle-in-cell code on the MPP, was 
fruitless. Because we assigned particles to process- 
ing elements, we encountered the difficulty of indirect 
addressing in the interpretation of forces at particle 
posistions from the nearest-neighbor grid points. The 
MPP currently cannot address indirectly other pro- 
cessing elements. Indirect addressing can be avoided 
in the gridded system if the cells are mapped to pro- 
cessing elements (Ref. 5). 

We have investigated the gridless system approach 
that uses extensive parallel computation and mini- 
mal communication among parallel processors. Our 
results so far indicate that the MPP, with improve- 
ments in both hardware and software, might be ac- 
ceptable for large scale scientific computation. The 
needed improvements include a faster I/O between the 
staging memory and the Array Unit, more memory in 
the Array Unit, and a more efficient procedure for the 
reduction sum. With these improvements the gridless 
system of particle simulations has the potential to be a 
useful and cost-effective method for simulating plasma 
phenomena on parallel computers. 
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